-
Notifications
You must be signed in to change notification settings - Fork 114
HNSW (Hierarchical Navigable Small Worlds) implementation as fdb-extension #3598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8a3c53c
to
87c5674
Compare
fde3d47
to
597d08b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first pass of the vector logic, that is, the Vector
class, the Metric
and Metrics
classes, and the serialization logic in StorageAdapter
, along with the attendant tests. In theory, we could take these separately, which might allow us to think more about how we'd expose these through the SQL type system above
grpc = "1.64.1" | ||
grpc-commonProtos = "2.37.0" | ||
guava = "33.3.1-jre" | ||
half4j = "0.0.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is probably somewhat of an open question as to whether we want to add a new dependency to support fp16 encoding or not. It is certainly a bit of a shame that all consumers would need to take on the dependency, even if they do not end up using the HNSW index.
I could see the argument that we need yet another subproject here for vector indexes, and that's where we put the new code, like was done with fdb-record-layer-lucene
. Then projects that don't use the new vector sub-component don't need to worry about this.
I did also take a quick look, and it actually seems like writing our own fp32 to fp16 utility wouldn't be that hard. In the common case, it's just a matter of truncating the mantissa and re-encoding the exponent. There are some truncation edge cases (e.g., the numbers can blow up to infinity for a lot of exponent values) that aren't that bad, and then the only "hard" case is making sure that you can convert from a normal 32 (or 64) bit value to a sub-normal 16-bit value. But even that isn't too bad. I have some prototype code locally that could theoretically do that, using java short
values as the storage for 16-bit values. Getting the logic to go back and forth is actually not too bad; if we actually want to be able to do fp16 math on those shorts, that's a little more intense, though actually maybe not too too bad if we absolutely needed to.
appender.console.layout.pattern = %d [%level] %logger{1.} - %m %X%n%ex{full} | ||
|
||
rootLogger.level = debug | ||
rootLogger.level = info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this? Should we revert it?
*/ | ||
public abstract class Vector { | ||
@Nonnull | ||
final double[] data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason the base Vector
class wraps a double vector? Rather than, say, a FloatVector
wrapping a float[]
?
* Returns the number of elements in the vector. | ||
* @return the number of elements | ||
*/ | ||
public int size() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Should this be dimension
rather than size
? I don't think I actually have an opinion here, but it occurs to me as a decision we could make
* data type conversions and raw data representation. | ||
*/ | ||
public abstract class Vector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be the answer is "no", but is there a better name than vector? The main concern is that it conflicts with the abstract data type name. But it also appears to be the name that other systems use for this. The only alternative name I could think of would be something like FloatingPointVector
or like CartesianVector
or something, neither of which I'm all that happy with
private final Metric.ManhattanMetric manhattanMetric = new Metric.ManhattanMetric(); | ||
private final Metric.EuclideanMetric euclideanMetric = new Metric.EuclideanMetric(); | ||
private final Metric.EuclideanSquareMetric euclideanSquareMetric = new Metric.EuclideanSquareMetric(); | ||
private final Metric.CosineMetric cosineMetric = new Metric.CosineMetric(); | ||
private Metric.DotProductMetric dotProductMetric; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- Should this reference the
Metric
objects from theMetrics
enum? It seems weird to have that enum but still create instances of the metric implementations on their own - Why is
dotProductMetric
handled differently from all of the other metrics
} | ||
|
||
@Test | ||
public void manhattanMetricDistanceWithIdenticalVectorsShouldReturnZeroTest() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be parameterized on the Metric type? I believe all of the metrics should return zero on identical vectors except for dot product. The EnumSource
argument provider can be set up to exclude values, which we could do to skip that particular one
} | ||
|
||
@Test | ||
public void euclideanMetricDistanceWithDifferentPositiveVectorsShouldReturnCorrectDistanceTest() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This correct distance metric could also be made generic on the different types, though it would require a more involved arguments provider. Probably something like:
static Stream<Arguments> metricAndExpectedDistance() {
// Distance between (1.0, 2.0) and (4.0, 6.0)
return Stream.of(
Arguments.of(Metrics.MANHATTAN, 7.0), // |1 - 4| + |2 - 6| = 7
Arguments.of(Metrics.EUCLIDEAN, 5.0), // sqrt((1-4)^2 + (2-6)^2) = sqrt(9 + 16) = 5.0
Arguments.of(Metrics.SQUARE_EUCLIDEAN, 25.0), // (1-4)^2 + (2-6)^2 = 9 + 16 = 25
// etc
);
}
|
||
import static org.junit.jupiter.api.Assertions.assertEquals; | ||
|
||
public class MetricTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also probably add tests of the comparative distance function, to be sure. I could also see us adding a test of the triangle inequality, for the two that support it.
import java.util.stream.LongStream; | ||
import java.util.stream.Stream; | ||
|
||
public class VectorTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably be a good spot to add a few more unit tests of the Vector
class, including things like the hash code and converting between HalfVector
and DoubleVector
This PR implements a HNSW (Hierarchical Navigable Small World) structure as an part of
fdb-extension
according to https://arxiv.org/pdf/1603.09320.Vector
that holds component data usingHalf
s which are FP16 encoded.M
the optimal number of connections between nodes within a layer of the HNSWMMax
the maximum number of connections between nodes within a layer of the HNSWMMax0
the maximum number of connections between nodes within layer0
of the HNSWefConstruction
the size of the search queue during thorough (non-greedy) searches during insertionextendCandidates
a boolean indicating if we should extend the candidates we can connect to to by also searching within the neighbors of already found candidateskeepPrunedConnections
a boolean indicating whether under certain circumstances discarded candidates to connect to are reevaluated